Words Associated to each Gender (through PMI)

In this notebook we compute PMI scores for the vocabulary obtained in the previous notebook.



In [1]:

    
from __future__ import print_function, unicode_literals, division
from cytoolz.dicttoolz import valmap
from collections import Counter

import pandas as pd
import json 
import gzip
import numpy as np
import pandas as pd
import dbpedia_config



In [2]:

    
target_folder = dbpedia_config.TARGET_FOLDER

First, we load a list of English stopwords. We also add some stopwords that we found on the dataset while exploring word frequency.

Note that we store a list of stopwords in the file stopwords_en.txt in our target folder (in the case of the English edition).



In [3]:

    
with open('{0}/stopwords_{1}.txt'.format(target_folder, dbpedia_config.MAIN_LANGUAGE), 'r') as f:
    stopwords = f.read().split()

stopwords.extend('Monday Tuesday Wednesday Thursday Friday Saturday Sunday'.lower().split())
stopwords.extend('January February March April May June July August September October November December'.lower().split())
stopwords.extend('one two three four five six seven eight nine ten'.lower().split())

len(stopwords)









    Out[3]:





252

We also load our person data.



In [4]:

    
person_data = pd.read_csv('{0}/person_data_en.csv.gz'.format(target_folder), encoding='utf-8', index_col='uri')









    Out[4]:






  
    
      
      wikidata_entity
      class
      gender
      edition_count
      available_english
      available_editions
      birth_year
      death_year
      same_as
      label
    
    
      uri
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      http://dbpedia.org/resource/Melanie_Paschke
      Q452439
      http://dbpedia.org/ontology/Athlete
      female
      8
      True
      en|fr|de|it|sv|wikidata|cs|pl
      1970
      NaN
      http://fr.dbpedia.org/resource/Melanie_Paschke
      Melanie Paschke
    
    
      http://dbpedia.org/resource/A._Laurent
      Q278894
      http://dbpedia.org/ontology/Person
      male
      15
      True
      en|el|eo|hy|zh|pt|no|de|fr|wikidata|pl|uk|nds|...
      NaN
      NaN
      http://pt.dbpedia.org/resource/A._Laurent
      A. Laurent
    
    
      http://dbpedia.org/resource/Addison_S._McClure
      Q24217
      http://dbpedia.org/ontology/Politician
      male
      3
      True
      en|de|wikidata
      1839
      1903
      http://wikidata.dbpedia.org/resource/Q24217
      Addison S. McClure
    
    
      http://dbpedia.org/resource/Pina_Conti
      Q7194774
      http://dbpedia.org/ontology/Person
      female
      2
      True
      en|wikidata
      NaN
      NaN
      http://wikidata.dbpedia.org/resource/Q7194774
      Pina Conti
    
    
      http://dbpedia.org/resource/Ri_Kum-suk
      Q274956
      http://dbpedia.org/ontology/Athlete
      female
      5
      True
      en|fr|de|wikidata|ja
      1978
      NaN
      http://fr.dbpedia.org/resource/Ri_Kum-suk
      Ri Kum-suk



In [11]:

    
N = person_data.gender.value_counts()
N









    Out[11]:





male                  777429
female                142381
transgender female         2
Name: gender, dtype: int64

And our vocabulary. We will consider only words that appear in both genders (so it makes sense to compare association).



In [8]:

    
with gzip.open('{0}/vocabulary.json.gz'.format(target_folder), 'rb') as f:
    vocabulary = valmap(Counter, json.load(f))

common_words = list(set(vocabulary['male'].keys()) & set(vocabulary['female'].keys()))
len(common_words)









    Out[8]:





277999



In [20]:

    
def word_iter():
     for w in common_words:
        if w in stopwords:
            continue
        yield {'male': vocabulary['male'][w], 'female': vocabulary['female'][w], 'word': w}
    
words = pd.DataFrame.from_records(word_iter(), index='word')

Now we estimate PMI. Recall that PMI is:

$$\mbox{PMI}(c, w) = \log \frac{p(c, w)}{p(c) p(w)}$$

Where c is a class (or gender) and w is a word (or bigram in our case). To normalize PMI we can divide by $-\log p(c,w)$.



In [22]:

    
p_c = N / N.sum()
p_c









    Out[22]:





male                  0.845204
female                0.154794
transgender female    0.000002
Name: gender, dtype: float64



In [23]:

    
words['p_w'] = (words['male'] + words['female']) / N.sum()
words['p_w'].head(5)









    Out[23]:





word
biennials           0.000047
verplank            0.000010
soestdijk           0.000007
megumi_yokota       0.000008
kwame_kilpatrick    0.000008
Name: p_w, dtype: float64



In [30]:

    
words['p_male_w'] = words['male'] / N.sum()
words['p_female_w'] = words['female'] / N.sum()



In [31]:

    
words['pmi_male'] = np.log(words['p_male_w'] / (words['p_w'] * p_c['male'])) / -np.log(words['p_male_w'])
words['pmi_female'] = np.log(words['p_female_w'] / (words['p_w'] * p_c['female'])) / -np.log(words['p_female_w'])



In [32]:

    
words.head()









    Out[32]:






  
    
      
      female
      male
      p_w
      p_male_y
      p_female_y
      p_male_w
      p_female_w
      pmi_male
      pmi_female
    
    
      word
      
      
      
      
      
      
      
      
      
    
  
  
    
      biennials
      7
      36
      0.000047
      0.000039
      0.000008
      0.000039
      0.000008
      -0.000937
      0.004274
    
    
      verplank
      1
      8
      0.000010
      0.000009
      0.000001
      0.000009
      0.000001
      0.004325
      -0.024145
    
    
      soestdijk
      2
      4
      0.000007
      0.000004
      0.000002
      0.000004
      0.000002
      -0.019220
      0.058828
    
    
      megumi_yokota
      5
      2
      0.000008
      0.000002
      0.000005
      0.000002
      0.000005
      -0.083182
      0.126145
    
    
      kwame_kilpatrick
      3
      4
      0.000008
      0.000004
      0.000003
      0.000004
      0.000003
      -0.031707
      0.080609

Now we are ready to explore PMI. Recall that PMI overweights words that have extremely low frequencies. We need to set a threshold for it. For instance, in our previous paper we considered 1% of biographies as threshold. But this time we have more biographies, and with 1% we don't have 200 words for women.

Hence, this time we lower the bar up to 0.1%.



In [84]:

    
min_p = 0.001



In [91]:

    
top_female = words[words.p_w > min_p].sort_values(by=['pmi_female'], ascending=False)
top_female.head(10)









    Out[91]:






  
    
      
      female
      male
      p_w
      p_male_y
      p_female_y
      p_male_w
      p_female_w
      pmi_male
      pmi_female
    
    
      word
      
      
      
      
      
      
      
      
      
    
  
  
    
      actress
      33469
      3461
      0.040150
      0.003763
      0.036387
      0.003763
      0.036387
      -0.393954
      0.533343
    
    
      women_s
      19521
      2256
      0.023675
      0.002453
      0.021223
      0.002453
      0.021223
      -0.349232
      0.455864
    
    
      female
      9627
      1509
      0.012107
      0.001641
      0.010466
      0.001641
      0.010466
      -0.285457
      0.377238
    
    
      her_husband
      7250
      1002
      0.008971
      0.001089
      0.007882
      0.001089
      0.007882
      -0.284408
      0.358486
    
    
      women
      10849
      3583
      0.015690
      0.003895
      0.011795
      0.003895
      0.011795
      -0.220814
      0.355913
    
    
      woman
      7477
      2894
      0.011275
      0.003146
      0.008129
      0.003146
      0.008129
      -0.192344
      0.319695
    
    
      first_woman
      3077
      137
      0.003494
      0.000149
      0.003345
      0.000149
      0.003345
      -0.338985
      0.319655
    
    
      miss
      5032
      1368
      0.006958
      0.001487
      0.005471
      0.001487
      0.005471
      -0.211152
      0.312034
    
    
      pageant
      1898
      158
      0.002235
      0.000172
      0.002063
      0.000172
      0.002063
      -0.276578
      0.288791
    
    
      feminist
      1889
      169
      0.002237
      0.000184
      0.002054
      0.000184
      0.002054
      -0.271031
      0.287644



In [92]:

    
top_male = words[words.p_w > min_p].sort_values(by=['pmi_male'], ascending=False)
top_male.head(10)









    Out[92]:






  
    
      
      female
      male
      p_w
      p_male_y
      p_female_y
      p_male_w
      p_female_w
      pmi_male
      pmi_female
    
    
      word
      
      
      
      
      
      
      
      
      
    
  
  
    
      played
      12268
      181306
      0.210450
      0.197112
      0.013338
      0.197112
      0.013338
      0.063242
      -0.206849
    
    
      football
      742
      41227
      0.045628
      0.044821
      0.000807
      0.044821
      0.000807
      0.048417
      -0.304619
    
    
      footballer_who
      158
      29429
      0.032166
      0.031995
      0.000172
      0.031995
      0.000172
      0.047302
      -0.388361
    
    
      served
      10066
      125870
      0.147787
      0.136843
      0.010944
      0.136843
      0.010944
      0.045875
      -0.163313
    
    
      league
      3429
      65122
      0.074527
      0.070799
      0.003728
      0.070799
      0.003728
      0.044134
      -0.202015
    
    
      major_league
      82
      20595
      0.022480
      0.022390
      0.000089
      0.022390
      0.000089
      0.043221
      -0.392956
    
    
      john
      5018
      74777
      0.086751
      0.081296
      0.005455
      0.081296
      0.005455
      0.041132
      -0.172854
    
    
      football_player
      449
      23833
      0.026399
      0.025911
      0.000488
      0.025911
      0.000488
      0.040928
      -0.278667
    
    
      first_class
      195
      18834
      0.020688
      0.020476
      0.000212
      0.020476
      0.000212
      0.040601
      -0.320970
    
    
      son
      4648
      68820
      0.079873
      0.074820
      0.005053
      0.074820
      0.005053
      0.039658
      -0.169212

What we will do is to save both lists of top-200 words and then manually annotate them according to the following categories:

F: Family
R: Relationship
G: Gender
O: Other

We will add that categorization to the column "cat", and we will process it in the following notebook.



In [93]:

    
top_male.head(200).to_csv('{0}/top-200-pmi-male.csv'.format(target_folder), encoding='utf-8')
top_female.head(200).to_csv('{0}/top-200-pmi-female.csv'.format(target_folder), encoding='utf-8')

	wikidata_entity	class	gender	edition_count	available_english	available_editions	birth_year	death_year	same_as	label
uri
http://dbpedia.org/resource/Melanie_Paschke	Q452439	http://dbpedia.org/ontology/Athlete	female	8	True	en\|fr\|de\|it\|sv\|wikidata\|cs\|pl	1970	NaN	http://fr.dbpedia.org/resource/Melanie_Paschke	Melanie Paschke
http://dbpedia.org/resource/A._Laurent	Q278894	http://dbpedia.org/ontology/Person	male	15	True	en\|el\|eo\|hy\|zh\|pt\|no\|de\|fr\|wikidata\|pl\|uk\|nds\|...	NaN	NaN	http://pt.dbpedia.org/resource/A._Laurent	A. Laurent
http://dbpedia.org/resource/Addison_S._McClure	Q24217	http://dbpedia.org/ontology/Politician	male	3	True	en\|de\|wikidata	1839	1903	http://wikidata.dbpedia.org/resource/Q24217	Addison S. McClure
http://dbpedia.org/resource/Pina_Conti	Q7194774	http://dbpedia.org/ontology/Person	female	2	True	en\|wikidata	NaN	NaN	http://wikidata.dbpedia.org/resource/Q7194774	Pina Conti
http://dbpedia.org/resource/Ri_Kum-suk	Q274956	http://dbpedia.org/ontology/Athlete	female	5	True	en\|fr\|de\|wikidata\|ja	1978	NaN	http://fr.dbpedia.org/resource/Ri_Kum-suk	Ri Kum-suk

	female	male	p_w	p_male_y	p_female_y	p_male_w	p_female_w	pmi_male	pmi_female
word
biennials	7	36	0.000047	0.000039	0.000008	0.000039	0.000008	-0.000937	0.004274
verplank	1	8	0.000010	0.000009	0.000001	0.000009	0.000001	0.004325	-0.024145
soestdijk	2	4	0.000007	0.000004	0.000002	0.000004	0.000002	-0.019220	0.058828
megumi_yokota	5	2	0.000008	0.000002	0.000005	0.000002	0.000005	-0.083182	0.126145
kwame_kilpatrick	3	4	0.000008	0.000004	0.000003	0.000004	0.000003	-0.031707	0.080609

	female	male	p_w	p_male_y	p_female_y	p_male_w	p_female_w	pmi_male	pmi_female
word
actress	33469	3461	0.040150	0.003763	0.036387	0.003763	0.036387	-0.393954	0.533343
women_s	19521	2256	0.023675	0.002453	0.021223	0.002453	0.021223	-0.349232	0.455864
female	9627	1509	0.012107	0.001641	0.010466	0.001641	0.010466	-0.285457	0.377238
her_husband	7250	1002	0.008971	0.001089	0.007882	0.001089	0.007882	-0.284408	0.358486
women	10849	3583	0.015690	0.003895	0.011795	0.003895	0.011795	-0.220814	0.355913
woman	7477	2894	0.011275	0.003146	0.008129	0.003146	0.008129	-0.192344	0.319695
first_woman	3077	137	0.003494	0.000149	0.003345	0.000149	0.003345	-0.338985	0.319655
miss	5032	1368	0.006958	0.001487	0.005471	0.001487	0.005471	-0.211152	0.312034
pageant	1898	158	0.002235	0.000172	0.002063	0.000172	0.002063	-0.276578	0.288791
feminist	1889	169	0.002237	0.000184	0.002054	0.000184	0.002054	-0.271031	0.287644

	female	male	p_w	p_male_y	p_female_y	p_male_w	p_female_w	pmi_male	pmi_female
word
played	12268	181306	0.210450	0.197112	0.013338	0.197112	0.013338	0.063242	-0.206849
football	742	41227	0.045628	0.044821	0.000807	0.044821	0.000807	0.048417	-0.304619
footballer_who	158	29429	0.032166	0.031995	0.000172	0.031995	0.000172	0.047302	-0.388361
served	10066	125870	0.147787	0.136843	0.010944	0.136843	0.010944	0.045875	-0.163313
league	3429	65122	0.074527	0.070799	0.003728	0.070799	0.003728	0.044134	-0.202015
major_league	82	20595	0.022480	0.022390	0.000089	0.022390	0.000089	0.043221	-0.392956
john	5018	74777	0.086751	0.081296	0.005455	0.081296	0.005455	0.041132	-0.172854
football_player	449	23833	0.026399	0.025911	0.000488	0.025911	0.000488	0.040928	-0.278667
first_class	195	18834	0.020688	0.020476	0.000212	0.020476	0.000212	0.040601	-0.320970
son	4648	68820	0.079873	0.074820	0.005053	0.074820	0.005053	0.039658	-0.169212